Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1 mAP@0.5 on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach.
translated by 谷歌翻译
我们将最初在多维扩展和降低多元数据的降低领域发展为功能设置。我们专注于经典缩放和ISOMAP - 在这些领域中起重要作用的原型方法 - 并在功能数据分析的背景下展示它们的使用。在此过程中,我们强调了环境公制扮演的关键作用。
translated by 谷歌翻译
利用通用神经结构来替代手动设计或感应偏见,最近引起了广泛的兴趣。但是,现有的跟踪方法依赖于定制的子模块,需要进行架构选择的先验知识,从而阻碍了更通用系统中的跟踪开发。本文通过利用变压器主链进行关节特征提取和交互来提供简化的跟踪体系结构(SIMTRACK)。与现有的暹罗跟踪器不同,我们将输入图像序列化,并在单支骨架上直接串联。主链中的特征相互作用有助于删除精心设计的交互模块并产生更有效的框架。为了减少视觉变压器中的减速采样的信息丢失,我们进一步提出了动脉窗口策略,以可接受的计算成本提供更多多样化的输入补丁。我们的SimTrack在Lasot/TNL2K上以2.5%/2.6%的AUC增益提高了基线,并获得了与其他没有铃铛和哨声的其他专业跟踪算法竞争的结果。
translated by 谷歌翻译
在1970年代的两个重要非参数方法中出现了群集的:级别集或群集树,由Hartigan提出的级别树木,并通过福卢加和旅馆提出的梯度线或渐变流的聚类。在最近的一篇论文中,我们认为这两种方法的目的是根本值的,通过表明梯度流提供了沿着簇树移动的方法。在制作更强大的情况下,我们面临的事实是群集树没有定义底层密度的整个支持的分区,而梯度流动。在本文中,我们通过提出从群集树中获取分区的两种方法来解决这一难题 - 其中一个人在其自己的右侧非常自然 - 并且显示它们两者都减少到梯度流给出的分区根据对采样密度的标准假设。
translated by 谷歌翻译
本文在20世纪70年代出现的两个重要聚类方法之间建立了强大的对应方法:级别集或群集树的聚类,如Hartigan提出的梯度线或渐变线或福卢加和大学家所提出的梯度流。我们这样做通过显示我们可以通过遵循渐变上升流来向上移动群集树。
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
With the development of technology and sharing economy, Airbnb as a famous short-term rental platform, has become the first choice for many young people to select. The issue of Airbnb's pricing has always been a problem worth studying. While the previous studies achieve promising results, there are exists deficiencies to solve. Such as, (1) the feature attributes of rental are not rich enough; (2) the research on rental text information is not deep enough; (3) there are few studies on predicting the rental price combined with the point of interest(POI) around the house. To address the above challenges, we proposes a multi-source information embedding(MSIE) model to predict the rental price of Airbnb. Specifically, we first selects the statistical feature to embed the original rental data. Secondly, we generates the word feature vector and emotional score combination of three different text information to form the text feature embedding. Thirdly, we uses the points of interest(POI) around the rental house information generates a variety of spatial network graphs, and learns the embedding of the network to obtain the spatial feature embedding. Finally, this paper combines the three modules into multi source rental representations, and uses the constructed fully connected neural network to predict the price. The analysis of the experimental results shows the effectiveness of our proposed model.
translated by 谷歌翻译
In this paper, we investigate the possibility of the backward-differential-flow-like algorithm which starts from the minimum of convexification version of the polynomial. We apply the heat evolution convexification approach through Gaussian filtering, which is actually an accumulation version of Steklov's regularization. We generalize the fingerprint theory which was proposed in the theory of computer vision by A.L. Yuille and T. Poggio in 1980s, in particular their fingerprint trajectory equation, to characterize the evolution of minimizers across the scale. On the other hand, we propose the "seesaw" polynomials $p(x|s)$ and we find a seesaw differential equation $\frac{\partial p(x|s)}{\,ds}=-\frac{1}{p''(x)}$ to characterize the evolution of global minimizer $x^*(s)$ of $p(x|s)$ while varying $s$. Essentially, both the fingerprints $\mathcal{FP}_2$ and $\mathcal{FP}_3$ of $p(x)$, consisting of the zeros of $\frac{\partial^2 p(x,t)}{\partial x^2}$ and $\frac{\partial^3 p(x,t)}{\partial x^3}$, respectively, are independent of seesaw coefficient $s$, upon which we define the Confinement Zone and Escape Zone. Meanwhile, varying $s$ will monotonically condition the location of global minimizer of $p(x|s)$, and all these location form the Attainable Zone. Based on these concepts, we prove that the global minimizer $x^*$ of $p(x)$ can be inversely evolved from the global minimizer of its convexification polynomial $p(x,t_0)$ if and only if $x^*$ is included in the Escape Zone. In particular, we give detailed analysis for quartic and six degree polynomials.
translated by 谷歌翻译
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of offline videos and textual query sentences. However, in real scenarios, online videos are frequently accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This inspires us to generate associated captions from offline videos to help with existing text-video retrieval methods. To do so, we propose to use the zero-shot video captioner with knowledge of pre-trained web-scale models (e.g., CLIP and GPT-2) to generate captions for offline videos without any training. Given the captions, one question naturally arises: what can auxiliary captions do for text-video retrieval? In this paper, we present a novel framework Cap4Video, which makes use of captions from three aspects: i) Input data: The video and captions can form new video-caption pairs as data augmentation for training. ii) Feature interaction: We perform feature interaction between video and caption to yield enhanced video representations. iii) Output score: The Query-Caption matching branch can be complementary to the original Query-Video matching branch for text-video retrieval. We conduct thorough ablation studies to demonstrate the effectiveness of our method. Without any post-processing, our Cap4Video achieves state-of-the-art performance on MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%).
translated by 谷歌翻译